Setup and load data

The data needed for this exercise is taken from a similar course, in particular it comes from this publication. We will work with the output from the differential expression analysis of KO vs WT. The data files should be in a data/ subdirectory in your R working directory. You can download the data/ directory with files here.

  1. Make a new directory for this exercise and add the data/ directory to it. Start R and use setwd() to change the working directory to the one you created.

  2. We will use the following R packages in this exercise:

library(knitr)
library(topGO)
library(biomaRt)
library(piano)
library(NMF)

If you are not able to load any of these packages, try to install them, either using biocLite() (first you need to run source("http://www.bioconductor.org/biocLite.R")) or install.packages(). If something fails, try to understand the error message and fix it. If you get stuck, ask for help :)

  1. Start by reading in the differential expression results from the previous exercise:
diffExpRes <- read.delim("data/results_DE.txt", stringsAsFactors=F)
head(diffExpRes[,c(1,3,5,8,9)]) # skip some columns
##      ensembl_gene_id mgi_symbol     logFC       PValue          FDR
## 1 ENSMUSG00000034810      Scn7a -2.860822 4.613775e-45 6.456978e-41
## 2 ENSMUSG00000024799     Tm7sf2 -2.166921 9.630122e-44 6.738678e-40
## 3 ENSMUSG00000027200     Sema6d -1.836609 5.145875e-42 2.400551e-38
## 4 ENSMUSG00000022483     Col2a1  1.880766 2.420791e-41 7.273966e-38
## 5 ENSMUSG00000022231     Sema5a -3.039413 2.598773e-41 7.273966e-38
## 6 ENSMUSG00000023832      Acat2 -1.934613 8.941421e-41 2.085587e-37

Question 1: How many genes are significant at a cutoff of FDR<0.001?

Web-based enrichment analysis

We will start by performing overrepresentation analysis (a.k.a. list enrichment analysis, …) by using Enrichr and DAVID. Both use a scoring method similar to the Hypergeomtric/Fisher’s exact test. To do such a test, we need to have a list of interesting (e.g. differentially expressed) genes and a list of the background (also known as universe). However, both Enrichr and DAVID have their own background lists, so we do not need to specify them explicitly. First, let’s make a list of interesting genes!

  1. You can use the write.table function to print the top significant genes to copy paste into some web browser tools:
sel <- diffExpRes$ensembl_gene_id[diffExpRes$FDR<0.001]
write.table(sel, quote=F, row.names=F, col.names=F)

DAVID

  1. Select and copy the gene list given by the above command and head over to DAVID.
  2. Locate Functional Annotation in the left menu and click the more link and briefly scroll through that page to get an overview of what the Functional Annotation Tool at DAVID does.

Question 2: What is the alternative to the Fisher Exact P-Value used by DAVID?

  1. Now go to the Functional Annotation page and make sure the Upload tab is visible.
  2. Paste the copied gene-list, select the correct identifier, and select whether this is a gene list or background (discuss with other students if you are not sure, or ask the instructors). Submit list.

Question 3: Where all gene IDs recognized?

  1. Explore the results. For instance, click on Functional Annotation Clustering at the bottom of the page. This shows related gene-sets clustered together in larger groups for a nicer overview.

Question 4: What seems to be the main functions of the top significant genes? Does this make sense considering the experimental design?

  1. Now, rerun DAVID but use gene symbols as input instead. Hint: you need to modify the command above creating the sel object.

Question 5: Where all gene IDs recognized?
Question 6: Does this list give the same results as using the Ensembl IDs?

Enrichr

As an alternative to DAVID we can try Enrichr. This page takes human gene symbols as input, so first we need to translate our mouse Ensembl IDs to that. This can be done in many ways, one way is to use the biomaRt package (find a user guide here or type vignette("biomaRt")):

# Use biomaRt to translate the mouse genes into human orthologs:
# Select the Ensembl BioMart database
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
# Select the mouse dataset and update the Mart object: 
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
# Make the query:
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_ensembl_gene"), # this is what we want to extract 
            filters="ensembl_gene_id", # this determines the filter
            values=sel, # this are the values to filter for (get only the genes in our list)
            mart=ensembl)
head(bm)
##      ensembl_gene_id hsapiens_homolog_ensembl_gene
## 1 ENSMUSG00000000093               ENSG00000121068
## 2 ENSMUSG00000000120               ENSG00000064300
## 3 ENSMUSG00000000184               ENSG00000118971
## 4 ENSMUSG00000000223               ENSG00000102385
## 5 ENSMUSG00000000253               ENSG00000137198
## 6 ENSMUSG00000000416               ENSG00000077063

As you can see, the above code gave us a map between mouse and human ensembl gene IDs. But this is not exactly what we wanted, we need human gene symbols.

  1. Change the attributes argument to the proper setting.

Use the function listAttributes(ensembl) to see all possible options. Hint: if there are too many options to go through you can use grep to pull out the relevant ones, e.g.:

tmp <- listAttributes(ensembl)
tmp[grep("Human",tmp[,2]),]

If you are working in RStudio you can also use the convenient command View and then search for human, e.g.:

View(listAttributes(ensembl))

Try it also for the bm object. For future reference, note that you can also use listDatasets(ensembl) to see which datasetes you can choose from, try it!

Question 7: How many of the genes in our selected list have gene symbols?

  1. Get familiar with the duplicated and unique functions:
# A vector to test stuff on:
tmp <- c(1,2,2,3,4,4,4,5)
# Get unique elements:
unique(tmp)
## [1] 1 2 3 4 5
# Which elements are duplicated?
duplicated(tmp)
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE
# How many duplicates are there:
sum(duplicated(tmp))
## [1] 3
# Which are the duplicated elements:
tmp[duplicated(tmp)]
## [1] 2 4 4
# Which are the unique duplicated elements:
unique(tmp[duplicated(tmp)])
## [1] 2 4

Make sure you understand how these two functions work and what they do, then move ahead with the following steps and questions:

Question 8: Are there duplicates Ensembl IDs or gene symbols in the list for Enrichr (bm)? How can this affect the results that we get?

  1. Now, run getBM again, this time also adding information about % identity between mouse and human genes.
  2. Make a boxplot of the % identity for our gene list. Hint: try ?boxplot if you are unsure about this.

Question 9: Looking at the boxplot, are there any reasons to be concerned?
Question 10: Which gene in the list has the lowest % identity?

  1. Paste your list of human gene symbols on the Enrichr website. Look at Ontologies > GO Biological Process 2015.

Question 11: Are the results similar to those we got from DAVID?

  1. Look around on the Enrichr page and get familiar with the different results. Look also on the different views, e.g. Table is good to see the actual p-values of the results.

Question 12: What do the edges (links) in the network denote?
Question 13: We have this far used a cutoff of FDR<0.001. Try a few different cutoffs and rerun DAVID and Enrichr. How different are the results?
Question 14: For the Enrichr and DAVID results so far, can we know whether the identified functions are active/inactive or up/down-regulated in KO vs control? If not, how can we go about to get a clue about that?

Enrichment analysis in R

Visualizing GO-terms with TopGO

Above, we have looked at some different types of gene-set collections, but a lot of focus has been on GO-terms.

  1. Use the internet to learn about what GO-terms are (e.g. Wikipedia or here).

Question 15: What are the three main domains/ontologies?
Question 16: By whom and how are GO-terms maintained/defined/updated?
Question 17: What is the evidence code?
Question 18: How are GO-terms connected to each other?

We can use e.g. the topGO package to visualize gene-set analysis results on the GO hierarchy network (topGO manual).

  1. Run the code below, it takes a short while to complete. Meanwhile, go through the code and understand what it does.
# Format for topgo - remove genes without ensembl id:
d_topgo <- diffExpRes[!is.na(diffExpRes$ensembl_gene_id),]
rownames(d_topgo) <- d_topgo$ensembl_gene_id

# Create the topGOdata object:
GOobj <- new("topGOdata",
             description = "Simple session", # just a name, can be changed
             ontology = "BP", # set to BP, MF, or CC
             allGenes = setNames(diffExpRes$FDR, diffExpRes$ensembl_gene_id), # named numeric vector
             geneSel = function(x) x<0.01, # a function to select genes from the allGenes vector
             nodeSize = 10, # remove GO-terms with less than this number of genes
             mapping = "org.Mm.eg.db", # annotation database
             ID = "ensembl", # gene ID used as names in allGenes vector
             annot = annFUN.org) 
# Run GO-term enrichment analysis:
resultFisher <- runTest(GOobj, algorithm = "classic", statistic = "fisher")
# Plot results using the GO hierarchical DAG:
showSigOfNodes(GOobj, score(resultFisher), firstSigNode=15, useInfo ='all')

If you save the plot as a PDF it will be easier to zoom in and read the node text. (Sometimes the plotting misbehaves in RStudio, try running grid.newpage(), plot.new(), and/or par(mfcol=c(1,1)) if plotting looks weird.)

Question 19: What are the circles and squares denoting?
Question 20: Which is the most significant BP GO-term?

  1. Now make a topGO plot for the top 10 significant Cellular Compartments (CC), based on a gene list using a cutoff of FDR<0.001. It should look like the one below.

Question 21: Which is the most significant CC GO-term?

Piano

Piano is an R-package for carrying out gene-set analysis (see more info here).

  1. First we need to import some gene-sets. Load GO-terms from MSigDB:
gscGO <- loadGSC("data/GSC/c5.bp.v5.2.symbols.gmt")
gscGO
## First 50 (out of 4653) gene set names:
##  [1] "GO_REGULATION_O..." "GO_LACTATE_TRAN..." "GO_POSITIVE_REG..."
##  [4] "GO_DETECTION_OF..." "GO_CARDIAC_CHAM..." "GO_CALCIUM_ION_..."
##  [7] "GO_DNA_DEPENDEN..." "GO_TRNA_AMINOAC..." "GO_CIRCADIAN_RH..."
## [10] "GO_PHOSPHATIDYL..." "GO_SPINAL_CORD_..." "GO_REGULATION_O..."
## [13] "GO_PLATELET_DER..." "GO_GLUCURONATE_..." "GO_CELLULAR_RES..."
## [16] "GO_REGULATION_O..." "GO_POSITIVE_REG..." "GO_CEREBRAL_COR..."
## [19] "GO_POSITIVE_REG..." "GO_NEGATIVE_REG..." "GO_POTASSIUM_IO..."
## [22] "GO_L_PHENYLALAN..." "GO_REGULATION_O..." "GO_CARDIAC_MUSC..."
## [25] "GO_NEGATIVE_REG..." "GO_MOVEMENT_IN_..." "GO_REGULATION_O..."
## [28] "GO_APICAL_PROTE..." "GO_REGULATION_O..." "GO_FOREBRAIN_NE..."
## [31] "GO_POSITIVE_REG..." "GO_NEUROMUSCULA..." "GO_MITOTIC_CYTO..."
## [34] "GO_NEGATIVE_REG..." "GO_SMAD_PROTEIN..." "GO_CYTOPLASMIC_..."
## [37] "GO_MEIOTIC_CHRO..." "GO_POSITIVE_REG..." "GO_REGULATION_O..."
## [40] "GO_RNA_DEPENDEN..." "GO_REGULATION_O..." "GO_REGULATION_O..."
## [43] "GO_DENDRITE_DEV..." "GO_REGULATION_O..." "GO_G_PROTEIN_CO..."
## [46] "GO_MEMORY"          "GO_NEURON_DEVEL..." "GO_REGULATION_O..."
## [49] "GO_ENDOTHELIAL_..." "GO_POSITIVE_REG..."
## 
## First 50 (out of 15578) gene names:
##  [1] "PDE1B"    "NR4A2"    "MAOB"     "DRD4"     "TACR3"    "PARK2"   
##  [7] "COMT"     "HTR1A"    "GPR37"    "HPRT1"    "CHRNB2"   "SNCA"    
## [13] "SLC6A3"   "ABAT"     "DRD1"     "PNKD"     "PARK7"    "SLC16A8" 
## [19] "SLC16A5"  "SLC16A4"  "SLC5A12"  "EMB"      "SLC16A1"  "SLC16A7" 
## [25] "SLC16A12" "SLC16A13" "SLC16A6"  "SLC16A3"  "SLC16A11" "POLR2C"  
## [31] "POLR2J"   "CTDP1"    "RDBP"     "COBRA1"   "RSF1"     "POLR2B"  
## [37] "POLR2D"   "POLR2G"   "POLR2F"   "POLR2A"   "SNW1"     "CHD1"    
## [43] "POLR2K"   "CDK9"     "JUN"      "POLR2E"   "CCNT1"    "LEF1"    
## [49] "POLR2H"   "GTF2F2"  
## 
## Gene set size summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.0    17.0    34.0   111.5    92.0  1984.0 
## 
## No additional info available.

These are annotated with human gene symbols.

  1. Add these IDs to the data:
ensembl <- useMart("ENSEMBL_MART_ENSEMBL")
ensembl <- useDataset("mmusculus_gene_ensembl", mart=ensembl)
bm <- getBM(attributes=c("ensembl_gene_id","hsapiens_homolog_associated_gene_name"),
            filters="ensembl_gene_id",
            values=diffExpRes$ensembl_gene_id, # note that now we use all genes here!
            mart=ensembl)
bm <- bm[bm[,2]%in%unique(unlist(gscGO)),] # filter for genes in the GO gene-set collection
head(bm)
##      ensembl_gene_id hsapiens_homolog_associated_gene_name
## 1 ENSMUSG00000000001                                 GNAI3
## 2 ENSMUSG00000000028                                 CDC45
## 6 ENSMUSG00000000078                                  KLF6
## 7 ENSMUSG00000000085                                 SCMH1
## 8 ENSMUSG00000000088                                 COX5A
## 9 ENSMUSG00000000093                                  TBX2

Question 22: Are there any duplicates in bm? How many? What does this mean?

  1. Let’s take a simple approach and remove all ENSMUS duplicates in bm:
bm <- bm[!duplicated(bm[,1]),]

Note that multiple ENSMUS IDs still can map to the same gene symbol!

  1. Extract the gene-level statistics that we will use:
# merge diffExpRes and bm:
d_piano <- merge(diffExpRes, bm, by="ensembl_gene_id", all.x=T, sort=F)
# get geneNames, pvals and logfc:
geneNames <- d_piano$hsapiens_homolog_associated_gene_name
pvals <- d_piano$FDR
names(pvals) <- geneNames
logfc <- d_piano$logFC
names(logfc) <- geneNames

Question 23: How many unique mouse Ensembl IDs could we map to human orthologs? How many unique human orhtologs do we map to?

Overrepresentation analysis

First, let’s use piano to perform a hypergeometric test, similar to what we did with Enrichr, DAVID, and topGO.

  1. Run the following code:
# Get the gene-list:
sel <- names(pvals)[pvals<0.001]
sel <- unique(sel[!is.na(sel)])
# Run the analysis:
res <- runGSAhyper(genes=sel, universe=unique(names(pvals)), gsc=gscGO, gsSizeLim=c(20, 200))
# Visualize results:
networkPlot(res, class="non", adjusted=T, significance=0.0001, ncharLabel=Inf,
            nodeSize=c(3,20), edgeWidth=c(1,15), cexLabel=0.7, overlap=20, 
            scoreColors=c("greenyellow","darkgreen"))

Here, we use the networkPlot function which draws a network for the most significant gene-sets.

Question 24: What do the colors, node-sizes, and edges denote?
Question 25: Are the results in line with those from topGO, DAVID, and Enrichr?

Gene-set analysis

Next, we will use piano to perform gene-set analysis, i.e. not the enrichment of a gene-list based on a cut-off, but using all available gene-level statistics. We will use signed (to keep track of fold-change) -log10-pvalues to rank the gene-sets, i.e. -log10(pvals)*sign(logfc).

  1. Run piano and plot the results:
gsaRes <- runGSA(-log10(pvals)*sign(logfc), geneSetStat="fgsea", gsc=gscGO, gsSizeLim=c(20, 200))

networkPlot(gsaRes, "distinct", "both", adjusted=T, significance=0.05, ncharLabel=Inf,
            nodeSize=c(3,15), edgeWidth=c(1,8), cexLabel=0.7, overlap=20,
            scoreColors=c("red", "orange", "yellow", "blue", "lightblue", "lightgreen"))

Question 26: Could gene overlap between you gene-sets bias the result interpretation in any way?
Question 27: What is the top GO-term affected by down-regulation and up-regulation, respectively?

  1. Now run a second GSA, according to the following:
  • This time use the file data/GSC/c2.cp.v5.2.symbols.gmt, which contains pathway gene-sets from MSigDB, as a gene-set collection (load it with loadGSC()).
  • This time use the mean method (option geneSetStat="mean") instead of fgsea.
  • Use only gene-sets containing 10-100 genes.
  1. Plot the results using a network plot, play around with the options until you think it looks nice and presents the results from the GSA in a good way.

For example it could look something similar to this:

Question 28: Are these results in line with previous results we have seen in this exercise? Do you feel there is a consensus pattern?

It is always good to go back to the gene-level while looking at your final GSA results. Here we will use a heatmap for visualizing gene-level data for selected gene-sets (using one of many heatmap functions available for R).

  1. Let’s do so for a specific gene-set:
# Get genes for a specific gene-set:
selGenes <- names(geneSetSummary(gsaRes, "GO_STEROID_BIOSYNTHETIC_PROCESS")$geneLevelStats)

# Merge with the diffexp result table:
selTab <- merge(cbind(selGenes), d_piano[,c(1,3,5,9,10)], 
                by.x=1, by.y="hsapiens_homolog_associated_gene_name", all.x=T)
selTab$log10FDR <- -log10(selTab$FDR)*sign(selTab$logFC)
# Display this table:
kable(selTab)
selGenes ensembl_gene_id mgi_symbol logFC FDR log10FDR
ACAA2 ENSMUSG00000036880 Acaa2 0.3227222 0.0729742 1.1368305
ACBD3 ENSMUSG00000026499 Acbd3 -0.2117000 0.8284461 -0.0817357
ACLY ENSMUSG00000020917 Acly -1.0100942 0.0000000 -14.8512572
ACOT8 ENSMUSG00000017307 Acot8 0.1464229 0.8183638 0.0870536
AKR1B15 ENSMUSG00000061758 Akr1b10 0.4710835 0.2556798 0.5923037
AKR1B15 ENSMUSG00000029762 Akr1b8 0.4084569 0.2292405 0.6397086
AKR1B15 ENSMUSG00000061758 Akr1b10 0.4710835 0.2556798 0.5923037
AKR1B15 ENSMUSG00000029762 Akr1b8 0.4084569 0.2292405 0.6397086
AKR1C3 ENSMUSG00000021213 Akr1c13 0.1587939 0.9247777 0.0339627
AKR1C4 ENSMUSG00000021213 Akr1c13 0.1587939 0.9247777 0.0339627
AMACR ENSMUSG00000022244 Amacr 0.1090078 0.8615007 0.0647444
ARV1 ENSMUSG00000031982 Arv1 -0.0702522 0.9464355 -0.0239090
C14orf1 ENSMUSG00000021252 0610007P14Rik -0.9167006 0.0000000 -9.7075763
CACNA1H ENSMUSG00000024112 Cacna1h 0.6248563 0.0628600 1.2016254
CNBP ENSMUSG00000030057 Cnbp 0.0848649 0.8070392 0.0931054
CYB5R1 ENSMUSG00000026456 Cyb5r1 0.4548193 0.1030950 0.9867626
CYB5R3 ENSMUSG00000018042 Cyb5r3 -0.2805545 0.1706601 -0.7678679
CYP11A1 ENSMUSG00000032323 Cyp11a1 0.6369639 0.5239671 0.2806960
CYP27A1 ENSMUSG00000026170 Cyp27a1 0.1075853 0.9047557 0.0434687
CYP2R1 ENSMUSG00000030670 Cyp2r1 -0.2493062 0.8816242 -0.0547165
CYP39A1 ENSMUSG00000023963 Cyp39a1 -1.0522696 0.0000000 -9.8271986
CYP3A4 ENSMUSG00000029727 Cyp3a13 0.0295973 0.9806181 0.0085001
CYP46A1 ENSMUSG00000021259 Cyp46a1 -0.4376909 0.6736854 -0.1715429
CYP51A1 ENSMUSG00000001467 Cyp51 -1.6531937 0.0000000 -20.8889340
CYP7B1 ENSMUSG00000039519 Cyp7b1 -0.0504172 0.8895779 -0.0508160
DHCR24 ENSMUSG00000034926 Dhcr24 -1.7828603 0.0000269 -4.5701737
DHCR7 ENSMUSG00000058454 Dhcr7 -0.8584124 0.0000000 -8.8078473
EBP ENSMUSG00000031168 Ebp -0.5232713 0.0206938 -1.6841595
FDFT1 ENSMUSG00000021273 Fdft1 -1.4720779 0.0000000 -22.7374916
FDPS ENSMUSG00000059743 Fdps -2.0806552 0.0000232 -4.6340487
FDX1 ENSMUSG00000032051 Fdx1 0.3163207 0.4555077 0.3415043
FDXR ENSMUSG00000018861 Fdxr -0.2715724 0.4486059 -0.3481350
GGPS1 ENSMUSG00000021302 Ggps1 -0.1880611 0.7162604 -0.1449291
HINT2 ENSMUSG00000028470 Hint2 0.1070335 0.8354226 0.0780938
HMGCR ENSMUSG00000021670 Hmgcr -1.9043148 0.0000000 -22.9236949
HMGCS1 ENSMUSG00000093930 Hmgcs1 -2.1349687 0.0000000 -21.3281530
HMGCS2 ENSMUSG00000027875 Hmgcs2 0.4020805 0.0251885 1.5987971
HSD11B2 ENSMUSG00000031891 Hsd11b2 1.0354925 0.1483534 0.8287025
HSD17B11 ENSMUSG00000029311 Hsd17b11 0.2584276 0.2948881 0.5303427
HSD17B12 ENSMUSG00000027195 Hsd17b12 -0.4645855 0.0014895 -2.8269590
HSD17B14 ENSMUSG00000030825 Hsd17b14 -0.4382260 0.7236379 -0.1404787
HSD17B4 ENSMUSG00000024507 Hsd17b4 -0.0393758 0.9229765 -0.0348093
HSD17B7 ENSMUSG00000026675 Hsd17b7 -1.4288825 0.0000000 -9.6138912
HSD3B7 ENSMUSG00000042289 Hsd3b7 0.0582406 0.9228457 0.0348709
IDI1 ENSMUSG00000058258 Idi1 -1.8497690 0.0000000 -7.7800272
INSIG1 ENSMUSG00000045294 Insig1 -1.4724813 0.0000000 -20.1443378
INSIG2 ENSMUSG00000003721 Insig2 -0.0976972 0.8562685 -0.0673900
LBR ENSMUSG00000004880 Lbr -0.2599271 0.3926051 -0.4060441
LSS ENSMUSG00000033105 Lss -2.3815805 0.0000018 -5.7332717
MED1 ENSMUSG00000018160 Med1 0.1449697 0.5948939 0.2255605
MSMO1 ENSMUSG00000031604 Msmo1 -1.9228029 0.0000000 -30.1111731
MVK ENSMUSG00000041939 Mvk -1.0533312 0.0000000 -12.1796821
NSDHL ENSMUSG00000031349 Nsdhl -1.3307947 0.0000000 -14.8501597
PBX1 ENSMUSG00000052534 Pbx1 0.0237174 0.9547740 0.0200994
PMVK ENSMUSG00000027952 Pmvk -1.0456293 0.0000000 -10.2062817
PRKAA1 ENSMUSG00000050697 Prkaa1 0.7674139 0.3298035 0.4817447
PRKAA2 ENSMUSG00000028518 Prkaa2 1.0590779 0.0226386 1.6451495
PRKAG2 ENSMUSG00000028944 Prkag2 -0.5878811 0.0552485 -1.2576795
SCARB1 ENSMUSG00000037936 Scarb1 -0.3765647 0.0507384 -1.2946633
SCP2 ENSMUSG00000028603 Scp2 -0.2126157 0.8743149 -0.0583321
SDR42E1 ENSMUSG00000034308 Sdr42e1 0.3786522 0.6575808 0.1820508
SQLE ENSMUSG00000022351 Sqle -1.2502197 0.0000000 -14.1391715
SRD5A3 ENSMUSG00000029233 Srd5a3 0.0355565 0.9386586 0.0274923
STAR ENSMUSG00000031574 Star -0.1712996 0.8121067 -0.0903869
STARD3 ENSMUSG00000018167 Stard3 -0.2927755 0.1854065 -0.7318751
STARD5 ENSMUSG00000046027 Stard5 0.4838561 0.0045057 2.3462366
TM7SF2 ENSMUSG00000024799 Tm7sf2 -2.1669206 0.0000000 -39.1714253
TSPO ENSMUSG00000041736 Tspo 0.0877881 0.8059029 0.0937173
WNT4 ENSMUSG00000036856 Wnt4 1.8713034 0.0014986 2.8243258

Question 29: Does the table look like you expect it to?

  1. Now, let’s fetch the count data for these genes and make a heatmap of all the information we have collected:
# Read in the count data:
tableCounts <- read.table(file="data/tableCounts_with_location.tab", sep="\t", header=TRUE, skip=1)
rownames(tableCounts) <- tableCounts[,1]
tableCounts <- tableCounts[,7:12]
colnames(tableCounts) <- c("KO_1","KO_2", "KO_3", "WT_1", "WT_2", "WT_3");

# Merge count data with the selTab:
selTab <- merge(selTab, tableCounts, by.x="ensembl_gene_id", by.y=0, all.x=T)

# Plot a heatmap:
# Row annotation:
rowann <- list(Significant=ifelse(selTab$FDR<0.05,"yes","no"), Regulation=ifelse(selTab$logFC>0,"Up","Down"))
rowannCol <- list(Significant=c("black","green"), Regulation=c("blue","red"))
# Row labels:
rowlabs <- selTab$mgi_symbol
rowlabs[duplicated(rowlabs)] <- paste(rowlabs[duplicated(rowlabs)]," ")
# Plot:
aheatmap(selTab[,7:12], scale="row", annRow=rowann, annColors=rowannCol, 
         labRow=rowlabs,
         txt=selTab[,7:12])

Note that here we scale the rows so the colors will be visualizing the difference for each gene specifically, but is not comparable across genes. Therefore we also include the actual counts in each heatmap cell.

  1. Make a heatmap without the scaling.

Question 30: Does it look “better”, what is the drawback/benefit?

  1. Change the row-labels to mouse ENSEMBL IDs.

Question 31: Given the count data for these genes, does the differential expression results make sense, as indicated by the column annotation columns? Any specific genes that you are concerned about?
Question 32: Would you say from looking at the heatmap, that in general steroid biosynthesis is down-regulated or up-regulated? Does your answer reflect the GSA results as seen in the network plot above?

  1. Spot-check some of these genes to see what their functions are (Google search or similar).

Question 33: Are these genes actually involved in steroid biosynthesis? Are you confident enough from the data and analysis results to say anything about how YAP/TAZ knockout affects this biological process?

  1. If you have time, do a similar heatmap but for another gene-set of your choice.

pull out and use GO evidence code? fixa toc float? kolla subset av frågor som ska lämnas in? Lämna in kod?

Session info

This page was generated using the following R session:

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.6 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
##  [1] grid      stats4    parallel  stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] RColorBrewer_1.1-2   Rgraphviz_2.18.0     org.Mm.eg.db_3.4.0  
##  [4] NMF_0.20.6           cluster_2.0.5        rngtools_1.2.4      
##  [7] pkgmaker_0.22        registry_0.3         piano_1.14.0        
## [10] biomaRt_2.30.0       topGO_2.26.0         SparseM_1.72        
## [13] GO.db_3.4.0          AnnotationDbi_1.36.0 IRanges_2.8.0       
## [16] S4Vectors_0.12.0     Biobase_2.34.0       graph_1.52.0        
## [19] BiocGenerics_0.20.0  knitr_1.14          
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7        lattice_0.20-34    relations_0.6-6   
##  [4] gtools_3.5.0       assertthat_0.1     digest_0.6.10     
##  [7] foreach_1.4.3      gridBase_0.4-7     slam_0.1-37       
## [10] plyr_1.8.4         chron_2.3-47       RSQLite_1.0.0     
## [13] evaluate_0.10      highr_0.6          ggplot2_2.1.0     
## [16] gplots_3.0.1       data.table_1.9.6   gdata_2.17.0      
## [19] rmarkdown_1.1      sets_1.0-16        BiocParallel_1.8.0
## [22] stringr_1.1.0      igraph_1.0.1       RCurl_1.95-4.8    
## [25] munsell_0.4.3      fgsea_1.0.0        marray_1.52.0     
## [28] htmltools_0.3.5    tibble_1.2         gridExtra_2.2.1   
## [31] codetools_0.2-15   matrixStats_0.51.0 XML_3.98-1.4      
## [34] bitops_1.0-6       xtable_1.8-2       gtable_0.2.0      
## [37] DBI_0.5-1          magrittr_1.5       formatR_1.4       
## [40] scales_0.4.0       KernSmooth_2.23-15 stringi_1.1.2     
## [43] reshape2_1.4.1     doParallel_1.0.10  limma_3.30.0      
## [46] fastmatch_1.0-4    iterators_1.0.8    tools_3.3.1       
## [49] yaml_2.1.13        colorspace_1.2-7   caTools_1.17.1